Objective. Using Python, the goal of this project is to implement the k-means clustering algorithm, a technique often used in machine learning, and use it for data analysis. We write various functions using lists, sets, dictionaries, sorting, and graph data structures for computational problem-solving and analysis.


Part 1. Spotify API Data

Spotify is a popular audio streaming platform with an extensive music database. The Spotify API allows developers to access the platform’s data providing global insights into music listening habits around the world[1]. Using the API requires an initial setup involving several steps. These steps include registering as a Spotify developer, creating an app, modifying the dashboard redirect URI, and storing the client ID and secret. After completing the initial steps for setup, we have access to the Spotify API and all its features.

Get Playlist Data from API

First, we create a Client Credentials Flow Manager used in server-to-server authentication by passing the necessary parameters to the Spotify OAuth class[2]. We provide a client id and client secret to the constructor of this authorization flow, which does not require user interaction.

# Set client id and client secret
client_id = 'xxx'
client_secret = 'xxx'

# Spotify authentication
client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

Now we can get the full details of the tracks of a playlist based on a playlist ID, URI, or URL. Choose a specific playlist to analyze by copying the URL from the Spotify Player interface. Using that link, the following code uses the playlist_tracks method to retrieve a list of IDs and corresponding artists for each track from the playlist.

for link in playlist_links:
    playlist_URI = link.split("/")[-1].split("?")[0]
    # Iterate over list of tracks in playlist
    for i in sp.playlist_tracks(playlist_URI)["items"]:   
        track_ids.append(i['track']["id"]) # Extract song id
        artist_ids.append(i['track']["artists"][0]["uri"]) # Extract artist id

Then, we write a function that takes the playlist data from the API and gets the metadata and audio characteristics of each track. Specifically, the function reads the query results for a playlist and returns the track name, track ID, artist, album, duration, popularity, artist popularity, artist genre, and audio characteristics for each track.

  • name: The name of the track.
  • album: The name of the album on which the track appears.
  • artist: The name of the artist who performed the track.
  • release_date: The date the album was first released.
  • length: The track length in milliseconds.
  • popularity: The popularity of the track calculated by an algorithm based on the total number of plays the track has had and how recent those plays are.
  • artist_pop: The popularity of the artist calculated from the popularity of all the artist’s tracks.
  • artist_genres: A list of the genres the artist is associated with.

Spotify Audio Features

Spotify’s audio features are precalculated measures of both low-level and high-level perceptual music qualities that help classify a track. As indicated by the Spotify website, a quick explanation of each feature is shown below. More information on how to interpret these audio features is located at Spotify’s API documentation.

  • acousticness: A confidence measure of whether the track is acoustic.
  • danceability: Suitability for dancing based on tempo, rhythm, beat, and regularity.
  • energy: A perceptual measure of intensity and activity.
  • instrumentalness: Predicts whether a track contains no vocals.
  • liveness: Probability that the track was performed live.
  • loudness: Overall loudness of a track in decibels (dB).
  • speechiness: Detects the presence of spoken words in a track.
  • tempo: Estimated pace of a track in beats per minute (BPM).
  • valence: A measure describing the musical positiveness.

The following code loops through each track ID in the playlist and extracts the song information by calling the function we created. From there, we can create a dataframe by passing in the returned data using the pandas package.

# Loop over track ids
all_tracks = [playlist_features(track_ids[i], artist_ids[i], playlist_ids[i]) 
              for i in range(len(track_ids))]
name album artist release_date length popularity artist_pop artist_genres acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence
2 AM Pure Infinity SwaVay 2019-05-24 198577 54 51 [‘atl hip hop’, ‘indie hip hop’, ‘underground hip hop’] 0.434 0.783 0.341 9.85e-05 0.362 -12.353 0.0727 126.799 0.184
Golden Child Lady Wrangler Shaboozey 2018-10-05 177773 45 56 [‘pop rap’] 0.362 0.792 0.591 1.90e-06 0.360 -8.848 0.2900 151.029 0.365

Part 2. Similar Artists

First, we want to find the most frequently occurring artist in a given playlist. We use the value_counts function to get a sequence containing counts of unique values sorted in descending order.

# Count distinct values in column
tallyArtists = df.value_counts(["artist", "artist_id"]).reset_index(name='counts')
topArtist = tallyArtists['artist_id'][1]
artist artist_id counts
Juice WRLD 4MCBfE4596Uoi2O4DtmEMz 10
Post Malone 246dkjvS1zLTtiykXe5h60 8
SAINt JHN 0H39MdGGX6dbnnQPt6NQkZ 3

I can retrieve artist and artist-related data using the following code, passing the artist ID to the artist and artist-related artist functions under the spotipy package. The returned list of similar artists is sorted by similarity score based on the listener data[3].

a = sp.artist(topArtist)
ra = sp.artist_related_artists(topArtist)

Below is a sample of the result when we query Spotify for the most similar artists to the playlist’s top artist, creating a list that holds all of the artist source ids and target ids. We retrieve similar data for the nodes of the connection graph, creating a list that holds information for each specified artist.

source_name source_id target_name target_id
Post Malone 246dkjvS1zLTtiykXe5h60 Rae Sremmurd 7iZtZyCzp3LItcw1wtPI3D
Post Malone 246dkjvS1zLTtiykXe5h60 Huncho Jack 6extd4B6hl8VTmnlhpl2bY
Post Malone 246dkjvS1zLTtiykXe5h60 Tyla Yaweh 1MXZ0hsGic96dWRDKwAwdr

 

Let’s see how things look when we pull in the full dataset, with each of the artist’s top most similar artists and each of their most similar artists. The following visualization is based on the Spotify Similiar Artists API article and created with flourish studio.

Made with Flourish

Part 4. K Means Clustering

Next, we implement the K-Means clustering algorithm using the Scikit-Learn library to break down a playlist into several smaller playlists. The unsupervised learning algorithm divides similar data points into k groups by computing the distance to the centroid.

The first step is to define an appropriate predefined number (k) of clusters. We use the Elbow Method to determine the optimal k, as shown below[5].

## C:\PROGRA~3\Anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1036: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
##   warnings.warn(

Thus, we tune the clustering algorithm by running K-Means for a range of k values, obtaining the above figure. It looks like a value of 3 is optimal for this case. Next, we call the K-Means function and set the k value to 3 clusters.

from sklearn.cluster import KMeans
model = KMeans(n_clusters = 3)
model = model.fit(features)
my_df['cluster'] = list(model.labels_)

Considering that there are seven different audio features for the clustering task, we use principal component analysis (PCA) to reduce the dimensionality of the data into a more easily visualized set of variables.

from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pca_result = pca.fit_transform(features)

In the above code, we define a PCA instance to find two principal components determined from the features of the data. From there, we visualize the resulting clusters and explore the variation. The figure below shows our 3 clusters represented in 2-dimensional space.

acousticness danceability energy instrumentalness liveness speechiness valence
1 0.1193033 0.6751 0.6478 0.0002468313 0.1572567 0.1539933 0.5820333
2 0.5396875 0.737625 0.5372187 0.02251566 0.1513125 0.1588813 0.4577406
3 0.1411991 0.6402273 0.5815455 0.0002416193 0.1652682 0.1366114 0.2303227
  • Cluster 1 has the highest energy and valence, indicating that these tracks are faster-paced, louder, and more positive (e.g., happy, cheerful, euphoric) than the other clusters.
  • Cluster 2 has the highest acousticness, with a mean value of 0.5396875 over all the cluster’s songs. Cluster 2 is also higher in danceability, indicating tracks with a faster tempo and beat intensity.
  • Cluster 3 appears to be the lowest valence, with a mean of 0.2303227, indicating more negative trajectories (e.g., sadness, frustration, anger).
  • All the clusters have values below 0.33, indicating that the songs most likely represent music and other non-speech-like tracks.

References

[1]
Web API Reference | Spotify for Developers, https://developer.spotify.com/documentation/web-api/reference/.
[2]
[3]
E. Webb, Visualizing Rap Communities with Python & Spotify’s API, https://unboxed-analytics.com/data-technology/visualizing-rap-communities-wtih-python-spotifys-api/.
[4]
Leonardo Mauro, Spotify Songs - Similarity Search, https://www.kaggle.com/code/leomauro/spotify-songs-similarity-search/notebook.
[5]
Chingis Oinar, Separate Your Saved Songs on Spotify into Playlists of Similar Songs, https://towardsdatascience.com/cluster-your-liked-songs-on-spotify-into-playlists-of-similar-songs-66a244ba297e.